Goto

Collaborating Authors

 trait score


Context is Enough: Empirical Validation of $\textit{Sequentiality}$ on Essays

Sunny, Amal, Gupta, Advay, Sreekumar, Vishnu

arXiv.org Artificial Intelligence

Recent work has proposed using Large Language Models (LLMs) to quantify narrative flow through a measure called sequentiality, which combines topic and contextual terms. A recent critique argued that the original results were confounded by how topics were selected for the topic-based component, and noted that the metric had not been validated against ground-truth measures of flow. That work proposed using only the contextual term as a more conceptually valid and interpretable alternative. In this paper, we empirically validate that proposal. Using two essay datasets with human-annotated trait scores, ASAP++ and ELLIPSE, we show that the contextual version of sequentiality aligns more closely with human assessments of discourse-level traits such as Organization and Cohesion. While zero-shot prompted LLMs predict trait scores more accurately than the contextual measure alone, the contextual measure adds more predictive value than both the topic-only and original sequentiality formulations when combined with standard linguistic features. Notably, this combination also outperforms the zero-shot LLM predictions, highlighting the value of explicitly modeling sentence-to-sentence flow. Our findings support the use of context-based sequentiality as a validated, interpretable, and complementary feature for automated essay scoring and related NLP tasks.


Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs

Luo, Junjie, Han, Rui, Welivita, Arshana, Di, Zeleikun, Wu, Jingfu, Zhi, Xuzhe, Agarwal, Ritu, Gao, Gordon

arXiv.org Artificial Intelligence

Interpersonal and professional qualities of physicians profoundly shape patient trust, communication, adherence, and health outcomes [1, 2]. Understanding these qualities from the patient's perspective is essential to advancing patient-centered care, yet current measurement tools--such as standardized surveys or aggregate star ratings--capture only a narrow view of the physician-patient relationship. In parallel, millions of online physician reviews now provide an abundant, patient-generated record of real-world experiences, offering an unprecedented opportunity to examine how physicians are perceived in everyday practice [3, 4, 5, 6]. Extracting clinically meaningful information from such narrative data remains challenging. Prior studies have typically relied on sentiment analysis or topic modeling, approaches that overlook the multidimensional nature of patient perceptions. Well-established frameworks from psychology, such as the Big Five personality traits [7], offer interpretable constructs for describing interpersonal style, but have rarely been operationalized at scale in healthcare settings [8]. Similarly, healthcare-specific qualities--communication effectiveness, perceived competence, attentiveness to outcomes, and trustworthiness--are widely recognized as central to care quality but are difficult to measure systematically. Manual coding of these traits is costly, inconsistent, and infeasible for national datasets. Recent advances in large language models (LLMs) enable a new approach [9].


Evaluating LLM Alignment on Personality Inference from Real-World Interview Data

Zhu, Jianfeng, Maharjan, Julina, Li, Xinyu, Coifman, Karin G., Jin, Ruoming

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI's text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.


Scaling Personality Control in LLMs with Big Five Scaler Prompts

Cho, Gunhee, Cheong, Yun-Gyung

arXiv.org Artificial Intelligence

We present Big5-Scaler, a prompt-based framework for conditioning large language models (LLMs) with controllable Big Five personality traits. By embedding numeric trait values into natural language prompts, our method enables fine-grained personality control without additional training. We evaluate Big5-Scaler across trait expression, dialogue generation, and human trait imitation tasks. Results show that it induces consistent and distinguishable personality traits across models, with performance varying by prompt type and scale. Our analysis highlights the effectiveness of concise prompts and lower trait intensities, providing a efficient approach for building personality-aware dialogue agents.


Teach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring

Do, Heejin, Ryu, Sangwon, Lee, Gary Geunbae

arXiv.org Artificial Intelligence

Multi-trait automated essay scoring (AES) systems provide a fine-grained evaluation of an essay's diverse aspects. While they excel in scoring, prior systems fail to explain why specific trait scores are assigned. This lack of transparency leaves instructors and learners unconvinced of the AES outputs, hindering their practical use. To address this, we propose a self-explainable Rationale-Driven Multi-trait automated Essay scoring (RaDME) framework. RaDME leverages the reasoning capabilities of large language models (LLMs) by distilling them into a smaller yet effective scorer. This more manageable student model is optimized to sequentially generate a trait score followed by the corresponding rationale, thereby inherently learning to select a more justifiable score by considering the subsequent rationale during training. Our findings indicate that while LLMs underperform in direct AES tasks, they excel in rationale generation when provided with precise numerical scores. Thus, RaDME integrates the superior reasoning capacities of LLMs into the robust scoring accuracy of an optimized smaller model. Extensive experiments demonstrate that RaDME achieves both accurate and adequate reasoning while supporting high-quality multi-trait scoring, significantly enhancing the transparency of AES.


Exploring the Potential of Large Language Models to Simulate Personality

Molchanova, Maria, Mikhailova, Anna, Korzanova, Anna, Ostyakova, Lidiia, Dolidze, Alexandra

arXiv.org Artificial Intelligence

With the advancement of large language models (LLMs), the focus in Conversational AI has shifted from merely generating coherent and relevant responses to tackling more complex challenges, such as personalizing dialogue systems. In an effort to enhance user engagement, chatbots are often designed to mimic human behaviour, responding within a defined emotional spectrum and aligning to a set of values. In this paper, we aim to simulate personal traits according to the Big Five model with the use of LLMs. Our research showed that generating personality-related texts is still a challenging task for the models. As a result, we present a dataset of generated texts with the predefined Big Five characteristics and provide an analytical framework for testing LLMs on a simulation of personality skills.


Autoregressive Score Generation for Multi-trait Essay Scoring

Do, Heejin, Kim, Yunsu, Lee, Gary Geunbae

arXiv.org Artificial Intelligence

Recently, encoder-only pre-trained models such as BERT have been successfully applied in automated essay scoring (AES) to predict a single overall score. However, studies have yet to explore these models in multi-trait AES, possibly due to the inefficiency of replicating BERT-based models for each trait. Breaking away from the existing sole use of encoder, we propose an autoregressive prediction of multi-trait scores (ArTS), incorporating a decoding process by leveraging the pre-trained T5. Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores. During decoding, the subsequent trait prediction can benefit by conditioning on the preceding trait scores. Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.


Personality Trait Recognition using ECG Spectrograms and Deep Learning

Altaf, Muhammad Mohsin, Khan, Saadat Ullah, Majd, Muhammad, Anwar, Syed Muhammad

arXiv.org Artificial Intelligence

This paper presents an innovative approach to recognizing personality traits using deep learning (DL) methods applied to electrocardiogram (ECG) signals. Within the framework of detecting the big five personality traits model encompassing extra-version, neuroticism, agreeableness, conscientiousness, and openness, the research explores the potential of ECG-derived spectrograms as informative features. Optimal window sizes for spectrogram generation are determined, and a convolutional neural network (CNN), specifically Resnet-18, and visual transformer (ViT) are employed for feature extraction and personality trait classification. The study utilizes the publicly available ASCERTAIN dataset, which comprises various physiological signals, including ECG recordings, collected from 58 participants during the presentation of video stimuli categorized by valence and arousal levels. The outcomes of this study demonstrate noteworthy performance in personality trait classification, consistently achieving F1-scores exceeding 0.9 across different window sizes and personality traits. These results emphasize the viability of ECG signal spectrograms as a valuable modality for personality trait recognition, with Resnet-18 exhibiting effectiveness in discerning distinct personality traits.


Review of feedback in Automated Essay Scoring

Jong, You-Jin, Kim, Yong-Jin, Ri, Ok-Chol

arXiv.org Artificial Intelligence

The first automated essay scoring system was developed 50 years ago. Automated essay scoring systems are developing into systems with richer functions than the previous simple scoring systems. Its purpose is not only to score essays but also as a learning tool to improve the writing skill of users. Feedback is the most important aspect of making an automated essay scoring system useful in real life. The importance of feedback was already emphasized in the first AES system. This paper reviews research on feedback including different feedback types and essay traits on automated essay scoring. We also reviewed the latest case studies of the automated essay scoring system that provides feedback.


Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

Do, Heejin, Kim, Yunsu, Lee, Gary Geunbae

arXiv.org Artificial Intelligence

Automated essay scoring (AES) aims to score essays written for a given prompt, which defines the writing topic. Most existing AES systems assume to grade essays of the same prompt as used in training and assign only a holistic score. However, such settings conflict with real-education situations; pre-graded essays for a particular prompt are lacking, and detailed trait scores of sub-rubrics are required. Thus, predicting various trait scores of unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining challenge of AES. In this paper, we propose a robust model: prompt- and trait relation-aware cross-prompt essay trait scorer. We encode prompt-aware essay representation by essay-prompt attention and utilizing the topic-coherence feature extracted by the topic-modeling mechanism without access to labeled data; therefore, our model considers the prompt adherence of an essay, even in a cross-prompt setting. To facilitate multi-trait scoring, we design trait-similarity loss that encapsulates the correlations of traits. Experiments prove the efficacy of our model, showing state-of-the-art results for all prompts and traits. Significant improvements in low-resource-prompt and inferior traits further indicate our model's strength.